Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 7th No.1】 Integrate PaddlePaddle as a New Backend #704

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AndPuQing
Copy link

This PR introduces PaddlePaddle backend support to the PySR library as part of my contribution to the PaddlePaddle Hackathon 7th.

Key Changes:

  • Introduced PaddlePaddle-specific modules and functions.
  • Updated the CI pipeline to include tests for the PaddlePaddle backend.
  • Modified the example code to support PaddlePaddle

Related Issues:

Issue Encountered:

During my integration process, I discovered that the method SymbolicRegression.equation_search causes a termination by signal, as shown in the following code snippet:

class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
    def _run(...)
        ...
        out = SymbolicRegression.equation_search(...) <---terminated by signal SIGSEGV (Address boundary error)
        ...

Interestingly, this error does not occur when I set PYTHON_JULIACALL_THREADS=1. Below are some of the steps I took to investigate the issue:

After the termination, I checked dmesg, which returned the following:

[376826.886941] R13: 00007f3605750ba0 R14: 00007f35fe3ae8b0 R15: 00007f35fe3ae920
[376826.886942] R13: 00007f3605750ba0 R14: 00007f35fe3addc0 R15: 00007f35fe3ade30
[376826.886942] FS:  00007f358b7ff6c0 GS:  0000000000000000
[376826.886943] FS:  00007f35b60fc6c0 GS:  0000000000000000
[376826.886941] FS:  00007f3572ffe6c0 GS:  0000000000000000
[376826.886943] RBP: 00007f330b08ea10 R08: 00000000ffffffff R09: 00007f3605dc2dc8
[376826.886944] FS:  00007f3588ec06c0 GS:  0000000000000000
[376826.886943] RAX: 00007f360698f008 RBX: 00007f3605750b00 RCX: 0000000000000000
[376826.886945] FS:  00007f354f1fe6c0 GS:  0000000000000000
[376826.886945] R10: 00007f3605dc2820 R11: 00007f3605326040 R12: 00007f35fe3af6c0
[376826.886945] FS:  00007f359d3fc6c0 GS:  0000000000000000
[376826.886945] RDX: 0000000000000001 RSI: 00007f3605750b00 RDI: 0000000000000000
[376826.886946] R13: 00007f3605750ba0 R14: 00007f35fe3af6c0 R15: 00007f35fe3af730
[376826.886962] FS:  00007f357bfff6c0 GS:  0000000000000000
[376826.886962] RBP: 00007f330e88ea10 R08: 00000000ffffffff R09: 00007f3605dc2dc8
[376826.886963] R10: 00007f3605dc2820 R11: 00007f3605326040 R12: 00007f35fe3ae270
[376826.886964] R13: 00007f3605750ba0 R14: 00007f35fe3ae270 R15: 00007f35fe3ae2e0
[376826.886965] FS:  00007f35709fa6c0 GS:  0000000000000000
[376826.887626]  in libjulia-internal.so.1.10.4[7f3605453000+1c1000]
[376826.888187]  in libjulia-internal.so.1.10.4[7f3605453000+1c1000]

I also used gdb --args python -m pysr test paddle to inspect the stack trace. The output is as follows:

➜ gdb --args python -m pysr test paddle
...
(gdb) run
W0824 18:25:46.120416  7735 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.3
W0824 18:25:46.129973  7735 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
.Compiling Julia backend...

Thread 6 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd61fd6c0 (LWP 7882)]
0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffef5b8e20, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
warning: 837    /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c: No such file or directory
(gdb) bt
#0  0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffef5b8e20, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
#1  0x00007ffff5abab63 in _jl_mutex_lock (self=self@entry=0x7fffef5b8e20, lock=0x7ffff5d50b00 <jl_codegen_lock>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:875
#2  0x00007ffff5926100 in jl_mutex_lock (lock=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_locks.h:65
#3  jl_generate_fptr_impl (mi=0x7ffe4c8c8d10 <jl_system_image_data+6611984>, world=31536, did_compile=0x7ffcf408eabc) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.cpp:483
#4  0x00007ffff5a691b9 in jl_compile_method_internal (world=31536, mi=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481
#5  jl_compile_method_internal (mi=<optimized out>, world=31536) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368
#6  0x00007ffff5a6a1be in _jl_invoke (world=31536, mfunc=0x7ffe4c8c8d10 <jl_system_image_data+6611984>, nargs=3, args=0x7ffcf408ebe0, F=0x7ffe4c566390 <jl_system_image_data+3062416>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2887
#7  ijl_apply_generic (F=<optimized out>, args=0x7ffcf408ebe0, nargs=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
#8  0x00007ffe4c23a44f in macro expansion () at /home/happy/.julia/packages/SymbolicRegression/9q4ZC/src/SingleIteration.jl:123
#9  julia_#271#threadsfor_fun#4_21198 () at threadingconstructs.jl:215
#10 0x00007ffe4c0d7cd0 in #271#threadsfor_fun () at threadingconstructs.jl:182
#11 julia_#1_24737 () at threadingconstructs.jl:154
#12 0x00007ffe4c123847 in jfptr_YY.1_24738 () from /home/happy/.julia/compiled/v1.10/SymbolicRegression/X2eIS_URG6E.so
#13 0x00007ffff5a69f8e in _jl_invoke (world=<optimized out>, mfunc=0x7ffe4c856190 <jl_system_image_data+6142096>, nargs=0, args=0x7fffef5b8e58, F=0x7ffe59c89110) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895
#14 ijl_apply_generic (F=<optimized out>, args=args@entry=0x7fffef5b8e58, nargs=nargs@entry=0) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
#15 0x00007ffff5a8d310 in jl_apply (nargs=1, args=0x7fffef5b8e50) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h:1982
#16 start_task () at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/task.c:1238
(gdb)

To be honest, I’m not very familiar with Julia, but it seems that the issue is related to multithreading within the SymbolicRegression library. I ran similar tests with the Torch backend, and below are the results:

➜ gdb --args python -m pysr test torch
(gdb) run
Thread 31 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4dcc06c0 (LWP 9082)]
0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffee9addc0, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
warning: 837    /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c: No such file or directory
(gdb) bt
#0  0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffee9addc0, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
#1  0x00007ffff5abab63 in _jl_mutex_lock (self=self@entry=0x7fffee9addc0, lock=0x7ffff5d50b00 <jl_codegen_lock>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:875
#2  0x00007ffff5926100 in jl_mutex_lock (lock=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_locks.h:65
#3  jl_generate_fptr_impl (mi=0x7ffe56d90560, world=31536, did_compile=0x7ffd82c6207c) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.cpp:483
#4  0x00007ffff5a691b9 in jl_compile_method_internal (world=31536, mi=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481
#5  jl_compile_method_internal (mi=<optimized out>, world=31536) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368
#6  0x00007ffff5a6a1be in _jl_invoke (world=31536, mfunc=0x7ffe56d90560, nargs=2, args=0x7ffd82c621b0, F=0x7fffe22c81b0 <jl_system_image_data+3615344>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2887
#7  ijl_apply_generic (F=<optimized out>, args=0x7ffd82c621b0, nargs=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
#8  0x00007ffe4bfe13c3 in julia_optimize_and_simplify_population_21133 () at /home/happy/.julia/packages/SymbolicRegression/9q4ZC/src/SingleIteration.jl:110

#30 0x00007fffe6459970 in jl_system_image_data () from /home/happy/micromamba/julia_env/pyjuliapkg/install/lib/julia/sys.so
#31 0x00007fffe2c14290 in jl_system_image_data () from /home/happy/micromamba/julia_env/pyjuliapkg/install/lib/julia/sys.so
#32 0x0000000000000001 in ?? ()
#33 0x00007ffff5a72617 in jl_smallintset_lookup (cache=<optimized out>, eq=0x40, key=0x14, data=0x7ffe4f3d0010, hv=3) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/smallintset.c:121
#34 0x00007fffdf4f82e5 in ?? ()
#35 0x00007ffe4f414218 in ?? ()
#36 0x00007fffef48c890 in ?? ()
#37 0x00007fffebb7f8f0 in ?? ()
#38 0x00007ffe4f3d0010 in ?? ()
#39 0x00007ffd82c62a40 in ?? ()
#40 0x00007ffff592846a in __gthread_mutex_unlock (__mutex=0x1) at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:779
#41 std::mutex::unlock (this=0x1) at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_mutex.h:118
#42 std::lock_guard<std::mutex>::~lock_guard (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_mutex.h:165
#43 JuliaOJIT::ResourcePool<llvm::orc::ThreadSafeContext, 0ul, std::queue<llvm::orc::ThreadSafeContext, std::deque<llvm::orc::ThreadSafeContext, std::allocator<llvm::orc::ThreadSafeContext> > > >::release (resource=..., this=0x7ffd82c62ad0)
    at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.h:462
#44 JuliaOJIT::ResourcePool<llvm::orc::ThreadSafeContext, 0ul, std::queue<llvm::orc::ThreadSafeContext, std::deque<llvm::orc::ThreadSafeContext, std::allocator<llvm::orc::ThreadSafeContext> > > >::OwningResource::~OwningResource (this=0x0, __in_chrg=<optimized out>)
    at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.h:404
#45 0x00007fffdf4f83c4 in ?? ()
#46 0x0000000000000010 in ?? ()
#47 0x00007ffd82c62c60 in ?? ()
#48 0x0000000000000000 in ?? ()
(gdb)

I am encountering some challenges understanding the internal behavior of the SymbolicRegression library. I would greatly appreciate any guidance or suggestions on how to resolve this issue.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 24, 2024

Thanks! Regarding the error, is this only on the most recent PySR version, or the previous one as well? There was a change in how multithreading was handled in juliacall so I wonder if it’s related.

Just to check, are you launching this with Python multithreading or multiprocessing? I am assuming it is a bug somewhere in PySR but just want to check if there’s anything non standard in how you are launching things.

Another thing to check — does it change if you import paddle first, before PySR? Or does the import order not matter? There is a known issue with PyTorch and JuliaCall where importing torch first prevents an issue with LLVM symbol conflicts (since Torch and Julia are compiled against different LLVM libraries if I remember correctly). Numba has something similar. Not sure if Paddle has something similar

@AndPuQing
Copy link
Author

AndPuQing commented Aug 26, 2024

Thanks! Regarding the error, is this only on the most recent PySR version, or the previous one as well? There was a change in how multithreading was handled in juliacall so I wonder if it’s related.

Just to check, are you launching this with Python multithreading or multiprocessing? I am assuming it is a bug somewhere in PySR but just want to check if there’s anything non standard in how you are launching things.

Another thing to check — does it change if you import paddle first, before PySR? Or does the import order not matter? There is a known issue with PyTorch and JuliaCall where importing torch first prevents an issue with LLVM symbol conflicts (since Torch and Julia are compiled against different LLVM libraries if I remember correctly). Numba has something similar. Not sure if Paddle has something similar

  1. I tested both the 0.18.4 and 0.18.1 release versions, and the issue occurs with both.
  2. I ran tests using pytest pysr/test/test_paddle.py -vv, and my Python version is 3.10.13.
  3. I adjusted the import order in test_paddle.py, but the issue persists regardless of whether pysr or paddle is imported first, so it doesn't seem related to the import order.
  4. As far as I know, Paddle depends on LLVM, which can be confirmed in the llvm.cmake file.

I believe the issue is indeed related to importing Paddle, as the error only occurs when SymbolicRegression.equation_search is executed after importing Paddle. It seems that importing Paddle might cause some side effects, but I'm not sure how to pinpoint the exact cause.

@MilesCranmer
Copy link
Owner

Quick followup – did you also try 0.19.4? There were some changes to juliacall that were integrated in 0.19.4 of PySR.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 26, 2024

It looks like Julia uses LLVM 18: https://github.com/JuliaLang/julia/blob/647753071a1e2ddbddf7ab07f55d7146238b6b72/deps/llvm.version#L8 (or LLVM 17 on the last version) whereas Paddle uses LLVM 11. I wonder if that is causing one of the issues.

Can you try running some generic juliacall stuff as shown in the guide here? https://juliapy.github.io/PythonCall.jl/stable/juliacall/. Hopefully we should be able to get a simpler example of the crash. I would be surprised if it is only PySR (and not Julia more broadly) but it could very well be just PySR.

@AndPuQing
Copy link
Author

Quick followup – did you also try 0.19.4? There were some changes to juliacall that were integrated in 0.19.4 of PySR.

It still occurs.

@AndPuQing
Copy link
Author

It looks like Julia uses LLVM 18: https://github.com/JuliaLang/julia/blob/647753071a1e2ddbddf7ab07f55d7146238b6b72/deps/llvm.version#L8 (or LLVM 17 on the last version) whereas Paddle uses LLVM 11. I wonder if that is causing one of the issues.

Can you try running some generic juliacall stuff as shown in the guide here? https://juliapy.github.io/PythonCall.jl/stable/juliacall/. Hopefully we should be able to get a simpler example of the crash. I would be surprised if it is only PySR (and not Julia more broadly) but it could very well be just PySR.

I created a minimal reproducible example, as shown in the code below.

import numpy as np
import pandas as pd
from pysr import PySRRegressor
import paddle

paddle.disable_signal_handler()
X = pd.DataFrame(np.random.randn(100, 10))
y = np.ones(X.shape[0])
model = PySRRegressor(
    progress=True,
    max_evals=10000,
    model_selection="accuracy",
    extra_sympy_mappings={},
    output_paddle_format=True,
    # multithreading=True,
)
model.fit(X, y)

Interestingly, when running PySRRegressor with either procs=0 (serial execution) or procs=cpu_count() with multithreading=False (multiprocessing), no issues occur. Additionally, setting PYTHON_JULIACALL_THREADS=1 allows multithreading to run without problems.

To summarize, when the environment variable PYTHON_JULIACALL_THREADS is set to Auto or a value greater than 1, and SymbolicRegression is running in multithreading mode, the equation_search function will terminate unexpectedly.

@MilesCranmer
Copy link
Owner

Are you able to build a MWE that only uses juliacall, rather than PySR? We should hopefully be able to boil it down even further. Maybe you could do something like

from juliacall import Main as jl
import paddle
paddle.disable_signal_handler()
jl.seval("""
Threads.@threads for i in 1:5
    println(i)
end
""")

which is like the simplest way of using multithreading in Julia.

@AndPuQing
Copy link
Author

Are you able to build a MWE that only uses juliacall, rather than PySR? We should hopefully be able to boil it down even further. Maybe you could do something like

from juliacall import Main as jl
import paddle
paddle.disable_signal_handler()
jl.seval("""
Threads.@threads for i in 1:5
    println(i)
end
""")

which is like the simplest way of using multithreading in Julia.

It will work as expected

➜ python test.py                                                                                                                                                                                                                                                            (base) 
1
2
3
4
5

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 27, 2024

Since those numbers appear in order it seems like you might not have multi-threading turned on for Julia? Normally they will appear in some random order as each will get printed by a different thread. (Note the environment variables I mentioned above.)

@AndPuQing
Copy link
Author

Since those numbers appear in order it seems like you might not have multi-threading turned on for Julia? Normally they will appear in some random order as each will get printed by a different thread. (Note the environment variables I mentioned above.)

➜ PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto python test.py (base)
1
3
5
2
4

@lijialin03
Copy link

Excuse me, is the current problem that paddle does not support multi-threading for Julia without setting PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto?
In addition, I tried the following code without paddle

from juliacall import Main as jl
jl.seval("""
Threads.@threads for i in 1:5
    println(i)
end
""")

and the result is also 1 2 3 4 5

@lijialin03
Copy link

Excuse me, Is this issue still being followed up?

@MilesCranmer
Copy link
Owner

Hi,
I’m not sure if there is anything to do from my end but I’m happy to help with specific things though. The last thing discussed was: I was wondering if a minimal working example could be created that demonstrates the same issues @AndPuQing observed.
But maybe if you try again the issues will be fixed now on the PaddlePaddle side?
Cheers,
Miles

@MilesCranmer
Copy link
Owner

Also, @lijialin03 feel free to make a new PR if interested as I’m not sure if @AndPuQing is still working on this one.

@lijialin03
Copy link

Also, @lijialin03 feel free to make a new PR if interested as I’m not sure if @AndPuQing is still working on this one.

Ok, I understand it, thank you for your reply

@AndPuQing
Copy link
Author

When the environment variable PYTHON_JULIACALL_HANDLE_SIGNALS is set to yes and PYTHON_JULIACALL_THREADS is set to auto or any value greater than 1, Julia terminates with a SIGSEGV signal.

Below is a minimal reproducible example to demonstrate the issue:

# test.py
import paddle
from pysr import jl


paddle.disable_signal_handler()

jl.seval(
    """
using Distributed: Distributed, @spawnat, Future, procs, addprocs

macro sr_spawner(expr, kws...)
    # Extract parallelism and worker_idx parameters from kws
    @assert length(kws) == 2
    @assert all(ex -> ex.head == :(=), kws)
    @assert any(ex -> ex.args[1] == :parallelism, kws)
    @assert any(ex -> ex.args[1] == :worker_idx, kws)
    parallelism = kws[findfirst(ex -> ex.args[1] == :parallelism, kws)::Int].args[2]
    worker_idx = kws[findfirst(ex -> ex.args[1] == :worker_idx, kws)::Int].args[2]
    return quote
        if $(parallelism) == :multithreading
            Threads.@spawn($(expr))
        else
            error("Invalid parallel type ", string($(parallelism)), ".")
        end
    end |> esc
end

@sr_spawner begin
    println("Hello, world!")
end parallelism=:multithreading worker_idx=1
"""
)
PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto python test.py

Interestingly, this problem is not 100% reproducible and may need to be run multiple times.(It looks like the result of some kind of thread competition)

Currently, this PR can pass the unit test by default. This error will only be encountered when multithreading=True

@MilesCranmer
Copy link
Owner

Since this reduces the problem to only an interaction between PaddlePaddle and PythonCall.jl, without PySR being necessary to get this MWE, perhaps we best copy this issue to both the PaddlePaddle repository AND to the PythonCall.jl repository? (Noting that Distributed.jl is in the Julia standard library)

I think the code can also be reduced more - there's no need for this sr_spawner macro - all that does is generate code. You can use @macroexpand to see the resultant code, and just use that instead.

@lijialin03
Copy link

SIGSEGV

When I use code with changing code to below to make sure the print can be displayed.

future = Threads.@spawn($(expr))
wait(future)

No errors encountered. And when I use code below to check thread's number, it is equal to PYTHON_JULIACALL_THREADS.

nthreads = jl.Threads.nthreads()
print(f"Julia is running with {nthreads} threads.")

What is more, when I change code to below, number of "Hello, world!" printed is also right.

futures = [Threads.@spawn($(expr)) for _ in 1:Threads.nthreads()]
for future in futures
    wait(future)
end

So, what is your environment? Could it be causing the problem?
My environment is

Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

paddlepaddle-gpu==3.0.0b1
julia> VERSION
v"1.10.7"
julia> Base.libllvm_version
v"15.0.7"

@MilesCranmer
Copy link
Owner

Could you post this in the PythonCall.jl issues and also Paddle? Since it doesn't seem to be caused by PySR itself it's probably best to move the issue to one of those (you can cc me to those issues)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants